Multivariate data exploration
GEOG 30323
September 29, 2015
Why visualize data?
The greatest value of a picture is when it forces us to notice what we never expected to see.
- Tukey (1977) quoted in Yau (2013)
Exploring data visually
Source: Yau, Data Points p. 137
Our schedule:
- Current activities: data exploration through visualization with common chart types
- Weeks 10-15: deep dive into data visualization
- More complex chart types
- How to customize your
seaborn plots
- Best practices in data visualization
- Interactive web-based graphics
- Maps!
Exploratory chart types
- Comparing categories: bar chart, dot plot
- Part-to-whole: pie chart
- Change over time: line chart
- Connections and relationships: scatter plot
Many, many more in these categories - these are just our focus for today!
Python and the web
- A brief aside: With Python, data on the web is at your fingertips (our topic for Week 9)
- This week, you will get a preview
import pandas as pd
mx_csv = 'http://personal.tcu.edu/kylewalker/mexico.csv'
mx = pd.read_csv(mx_csv)
mx.head()

Comparing categories
How about sorting our data?
mx_sorted = mx.sort('gdp08', ascending = False)
mx_sorted.head()

Bar charts
- Length or height of bars proportional to data values, allowing for comparisons between categories
- The value axis of bar charts must start at zero!!!
- Recommendation: sort your data values for ease of interpretation
Bar chart with non-zero origin
Source: Fox News via FlowingData.com
Bar charts in Python
%matplotlib inline
import seaborn as sns
mx.plot(x= 'name', y = 'gdp08', kind = 'bar')

Bar charts in seaborn
sns.set(style = 'whitegrid')
sns.barplot(x = 'gdp08', y = 'name', data = mx_sorted)

Dot plots
- Can be preferable to bar charts - values determined by position along axis rather than bar heights
- In turn, zero origin not strictly necessary (though consider the context)
- Sorted data also preferable for dot plots
Dot plots in seaborn
sns.stripplot(x = 'gdp08', y = 'name', data = mx_sorted)

Part-to-whole
- Categories in relationship to the entire population of values
- Examples: pie chart, waffle chart, 100% bar chart, tree map
- Must sum to 100%!
Pie charts in Python
zac = mx[mx.name == 'Zacatecas'].drop(['name', 'FID', 'gdp08', 'mus09'], axis = 1).squeeze()
zac.name = 'Zacatecas'
zac.plot(kind = 'pie', figsize = (6, 6))

Problems with pie charts
Source: Fox Chicago via FlowingData.com
Problems with pie charts

Line charts in pandas
hs_drop = pd.read_csv('http://personal.tcu.edu/kylewalker/data/hs_drop.csv')
hs_drop.sort('year', inplace = True)
hs_drop.set_index('year', inplace = True)
hs_drop.plot() # pandas plotting defaults to line charts, infers x from index

Line charts in seaborn
- Connected points available in
pointplot and factorplot
- Requires long-form data! (More to come on this in the next two weeks)
hs_drop.reset_index(inplace = True)
hs_long = pd.melt(hs_drop, id_vars = 'year',
value_vars = ['m_rate', 'f_rate'],
value_name = 'percent_drop', var_name = 'gender')
# We use factorplot because it gives us greater control over the axes
chart = sns.factorplot(data = hs_long, x = 'year',
y = 'percent_drop', hue = 'gender', size = 8)
chart.set_xticklabels(rotation = 45, step = 3)
Line charts in seaborn

Scatter plots
- Question: how do the values in two columns covary?
- Scatter plot: each observation represented by a point; position along x axis dictated by one column value; position along y axis dictated by other column value
- Regression line: visual representation of estimated statistical relationship between X and Y
Scatter plots in pandas
mx.plot(x = 'mus09', y = 'pri10', kind = 'scatter')

Scatter plots in seaborn
- Available in the
lmplot and regplot functions
sns.lmplot(data = mx, x = 'mus09', y = 'pri10')

Correlation
- Correlation coefficient: statistical representation of how two samples covary; ranges between -1 (negative correlation) and +1 (positive correlation)
- In
pandas: .corr()
- Beware of spurious correlations! http://tylervigen.com/spurious-correlations
mx['mus09'].corr(mx['pri10'])
0.41639990565936902 # the result